AO>A07I»  082 


UNCLASSIFIED 


MASSACHUSETTS  INST  OF  TECH  LEXINGTON  LINCOLN  LAB  F/8  17/2 

likelihood  NoI-iJV?(U) 

MN  79  R J NCAULAY*  M L MALPASS  F19628>78*C«0002 


TN-1979-31 


ES0-TR-7«»»163 


A074082 


MASSACHUSETTS  INSTITUTE  OF  TECHNOLOGY 


LINCOLN  LABORATORY 


SPEECH  ENHANCEMENT  USING  A 
SOFT- DECISION  MAXIMUM  LIKELIHOOD 
NOISE  SUPPRESSION  FILTER 


R.  J.  McAULAY 
M.  L.  MALPASS 


Group  24 


TECHNICAL  NOTE  1979-31 


19  JUNE  1979 


Approved  (or  public  release;  distribution  unlimited. 


LEXINGTON 


MASSACHUSETTS 


ABSTRACT 


I 


I 1 

r I 

! 1 

i One  way  of  enhancing  speech  in  an  additive  acoustic  noise  environment 

I is  to  perform  a spectral  decomposition  of  a frame  of  noisy  speech  and  to 

[ • attenuate  a particular  spectral  line  depending  on  how  much  the  measured 

speech  plus  noise  power  exceeds  an  estimate  of  the  background  noise.  Using  \ 

i 

a two  state  model  for  the  speech  event  (speech  absent  or  speech  present)  \ 

1 

and  determining  the  maximum  likelihood  estimator  of  the  speech  power 

results  in  a new  class  of  suppression  curves  which  permits  a tradeoff  of  i 

noise  suppression  against  speech  distortion.  The  algorithm  has  been  I 

implemented  in  real  time  in  the  time  domain,  exploiting  the  structure  of 
the  channel  vocoder.  Extensive  testing  has  shown  that  the  noise  can  be 
made  imperceptible  by  proper  choice  of  the  suppression  factor. 
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SPEECH  ENHANCEMENT  USING  A SOFT-PECISION 
MAXIMUM  LIKELIIWOD  NOISE  SUPPRESSION  FILTER 


I.  INTROUUCTION 

The  need  for  secure  military  voice  communication  has  led  to  the 
consideration  of  narrowband  digital  voice  terminals.  A preferred 
algorithm  for  this  task  is  linear-predictive  coding  (LPC)  which  has 
demonstrated  the  ability  to  produce  very  intelligible  speech  with 
Diagnostic  Rhyme  Test  (DRT)  scores  in  excess  of  90\  at  data  rates  as 
low  as  2400  bps.[l]  Unfortunately  these  results  have  been  achieved 
only  for  clean  speech,  whereas  many  of  the  practical  environments 
in  which  these  terminals  would  be  deployed,  such  as  the  airborne 
comnand  post  or  the  cockpits  of  jet  fighter  aircraft  and  helicopters, 
are  characterized  by  a high  ambient  noise  level,  which  in  many  . isos 
causes  the  vocoded  speech  to  suffer  a significant  degradation  in 
intelligibility. [2]  This  has  stimulated  research  into  the  problem 
of  extracting  the  speech  parameters  (pitch,  buzz-hiss  and  spectrum) 
from  noisy  speech  in  the  hope  that  more  robust  algorithms  could  be 
found. [3,4,5] 

Another  approach  to  the  noisy  speech  problem  is  to  develop  a prefilter 
that  would  enhance  the  speech  prior  to  encoding  so  that  the  existing  LPC 
vocoder  could  be  applied  in  tandem  without  modification.  Two  general 
classes  of  algorithms  have  emerged:  noise  cancelling  and  noise  suppression 
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prefilters.  In  the  first  case  the  coefficients  of  a tapped  delay  line  are 
adapted  to  produce  a minimum  mean  squared  error  estimate  of  the  noise 
signal  which  is  then  subtracted  from  the  noisy  speech  waveform  to  effect 
the  noise  cancellation. [6]  In  order  to  train  the  coefficients  of  the  noise 
cancelling  filter  it  is  usually  necessary  to  use  a second  microphone  to 
provide  a speech- free  measurement  of  the  background  noise.  Application  of 
this  technique  to  the  cancellation  of  E4A  advanced  airborne  command  post 
noise  has  shown  that  although  significant  improvement  in  signal-to-noise 
ratio  (SNR)  can  be  obtained,  the  improvement  in  intelligibility,  as  mea- 
sured by  the  Diagnostic  Rhyme  Test  (DRT) , is  marginal. [7]  Recent  work  by 
Sambur[8]  has  attempted  to  exploit  the  periodicity  of  voiced  speech  to 
eliminate  the  requirement  for  a second  microphone.  Thorough  evaluation  of 
this  algorithm  has  not  yet  been  published. 

Considerably  more  work  has  been  expended  on  the  development  of  noise 
suppression  prefilters.  In  this  approach  a spectral  decomposition  of  a 
frame  of  noisy  speech  is  performed  and  a particular  spectral  line  is 
attenuated  depending  on  how  much  the  measured  speech  plus  noise  power 
exceeds  an  estimate  of  the  background  noise  power. [9-13]  Algorithms 
using  the  FFT  have  been  tested  against  wideband  noise  and  improvements  in 
intelligibility  have  been  indicated  although  no  quantitative  results  have 
been  given. [11]  To  date,  the  attenuation  curves  have  been  proposed  on  more 
or  less  an  ad  hoc  basis,  hence  it  is  of  interest  to  determine  whether  or 
not  a more  fundamental  theoretical  analysis  could  lead  to  a new  suppression 
curve  with  substantially  different  properties.  In  the  next  section  an 
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analytical  model  is  proposed  and  used  to  determine  the  conditions  under 
which  the  existing  suppression  curves  can  be  justified.  Having  established 
a common  basis,  a new  suppression  curve  is  derived  recognizing  the  fact 
that  the  degree  of  suppression  should  be  weighted  by  the  probability  that  a 
given  measurement  corresponds  to  speech  plus  noise  or  to  noise  alone.  It 
is  shown  that  a class  of  curves  is  obtained  by  varying  the  value  of  a 
suppression  factor.  This  is  a parameter  that  can  be  chosen  to  trade  off 
noise  suppression  against  speech  distortion.  The  algorithm  has  been 
implemented  in  real  time  in  the  time  domain,  exploiting  the  structure  of 
the  channel  vocoder  to  perform  the  spectral  decomposition.  Extensive 
testing  has  shown  that  the  noise  can  be  made  imperceptable  by  proper  choice 
of  the  suppression  factor. 

II.  ANALYSIS 

The  prefilter  design  problem  arises  because  a speech  signal  s(t)  has 
been  corrupted  by  acoustically  coupled  background  noise  wCtj  to  form  the 
measurement  y(t)  = s(t)  + w(t).  In  speech  it  is  not  easy  to  specify  a 
criterion  which  would  lead  to  a "best"  estimate  of  s(t),  hence  a variety  of 
algorithms  are  often  proposed  and  evaluated  by  listening  to  the  processed 
results.  In  order  to  provide  a common  theoretical  basis  for  relating  some 
* of  these  algorithms  it  has  been  found  useful  to  analyze  the  prefilter  for  a 

frame  of  data  of  length  T ( T ~ 20  millisec).  A further  simplification  occurs 
by  expanding  y(t)  in  terms  of  a set  of  basis  functions  {^^(t)}  in  such  a 
way  that  the  expansion  coefficients  are  uncorrelated  random  variables.  If 
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the  covariance  function  of  y(t)  is  Ry(t,u),  then  a suitable  set  of  basis 
functions  are  obtained  from  the  Karhunen- Loeve  expansion. 


^(n)*^(t) 


T 

/ R (t,u)*  (u)du 

0 y ^ 


0 «.  t s T 


(1) 


Then  on  (O.T) 


y(t) 


I 

n»l 


(2a) 


T 

V ^ y(t)^n(t)clt  - 


(2b) 


Van  Trees(14]  shows  that  if  the  correlation  time  of  y(t)  is  less  than  the 
frame  interval  T,  than  an  appropriate  set  of  eigenfunctions  and  eigenvalues 
are 


♦n*') 


exp  U 


2iint- 
T ■' 


(3a) 


X(n)  - Sy(^) 


(3b) 


where 

Sy(f) 


(4) 


is  the  power  spectrum  of  the  observed  process.  Since  a narrowband  vocoder 
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M )*  i»*H  irttf/wtt  t^alt  the  perception  of  speech  is  phase  insensi 
> ) tn,  <*  ^ < r}lefj//fi  for  a prefilter  design  is  to  produce  the 
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where  = /X^(n)  since  if  X^(n)  were  known,  the  spectiuun  of  s(t)  would  be 
identical  to  the  spectrum  of  s(t).  Of  course,  it  is  not  known  and  provision 
must  be  made  for  estimating  its  value  from  an  observation  of  and  know- 
ledge of  X^(n).  Since  the  probability  density  function  for  the  complex 

Gaussian  variate  y„  is 
n 

, . ( 

P'^^n^  n[Xg(n)  ♦ | [X^Cn)  + 

then  by  maximizing  with  respect  to  the  maximum  likelihood 

estimate  of  X (n)  can  be  found  to  be 
s 

A^Cn)  = l/nl^  - A^(n)  (9) 


In  order  to  maintain  an  identity  system  in  the  absence  of  noise,  the  input 
phase  can  be  appended  to  the  prefilter  output  by  taking 


s 


n 


- A (n) 


1/2 


(10) 


which  is  known  as  the  method  of  power  subtraction.  Modifications  of 
this  algorithm  have  been  studied  extensively  by  Boll  [10],  Preuss  [12] 
and  Berouti,  et  al  [13]. 

B.  Wiener  Filtering 

Whereas  the  power  subtraction  algorithm  arises  from  an  attenpt  to 


I 
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obtain  the  best  estimate  of  the  speech  spectrum,  the  Wiener  Filter  corres- 
ponds to  the  criterion  of  minimizing  the  mean  squared  error  of  best  time 
domain  fit  to  the  speech  waveform.  Van  Trees  [15]  has  shown  that  this  can 
be  done  by  choosing  the  channel  coefficients  to  be 


^n  ' X^(n)  ♦ X^(n)  ‘ ^n 


(11) 


Since  the  speech  eigenvalues  are  unknown  k priori,  the  maximum  likelihood 
estimate  developed  in  (8)  can  be  used  in  (11)  to  result  in  the  suppression 
rule 


s 

n 


" - X (n) 
w 


(12) 


which  is  simply  the  square  of  the  suppression  rule  for  the  method  of  power 
subtraction. 

C.  Maximum  Likelihood  Envelope  Estimation 

The  previous  results  were  obtained  assuming  that  the  speech  and  the 
noise  were  independent  Gaussian  random  processes.  In  the  interest  of 
exploring  the  importance  of  this  assumption  an  alternative  model  is  pro- 
posed in  which  the  noise  is  a Gaussian  random  process  while  the  speech  is 
characterized  by  a deterministic  waveform  of  unknown  amplitude  and  phase. 

In  this  case  the  channel  measurement  is  y = s + w where  now  s = A exp(j0) 
where  A determines  the  speech  envelope  and  0 its  phase.  For  the  per- 
ception of  speech  an  optimum  estimate  of  its  envelope  is  desired  since 


i 

!« 
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this  would  represent  an  estimate  of  the  speech  spectrum  in  the  nth  channel. 
For  Gaussian  noise  the  probability  density  function  of  the  channel  mea- 
surement y is 
n 

, 1 r l^nl^  - ReCe'^V)  • ' 

PO-nl*-®)  = iirw  '-P VH) (13) 

To  obtain  the  maximum  likelihood  estimate  of  A,  a maximum  of  pCy^^lA,©)  is 
sought.  However  the  speech  phase  0 shows  up  as  a nuissance  parameter.  Its 
effect  can  be  eliminated  by  maximizing  the  average  likelihood  function 

2iv 


PCXnlA)  = y*  p(yn|A,0)p(0)d0 


where  p(0)  is  the  probability  density  function  for  the  phase.  Since  it  is 
reasonable  to  assume  a uniform  distribution  on  (0,  It),  then  the  likelihood 
function  for  the  spectral  envelope  becomes 


p(y„|A) 


1 

|y„l^*A^l 

1 

2v 

r 

= nX  fn)  • 

X (n) 

• 2tt 

J 

w 

w 

0 

The  integral  appearing  in  (15)  is  known  as  the  modified  Bessel  function  of 
the  first  kind  and  is  labelled 
2ir 

T r I V I ^ ^ Q-vn 


^ ®^P  [Re(e‘^  x)]  d0 
0 
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For  large  values  of  |x|  (>  3) 


/2n|xl 


(17) 


For  this  condition  the  likelihood  function  for  the  spectral  envelope 
becomes 


PC/nlA) 


1 


1 

21T  2A|yJ 
X (n) 


exp 


l/nl^  - 2A|y^|  + 


(18) 


Maximizing  this  function  with  respect  to  A leads  to  the  estimator 


A “ ylly^l  ♦ Vl/nl"  - 


(19) 


As  before  the  input  phase  can  be  appended  to  this  estimate  of  the  envelope 
to  produce  the  maximum  likelihood  estimate  of  the  speech  waveform 


s 


n 


^n 

fyj 


1 , 1 

IXnl'  - ^(») 

2 2^ 

1 t ^ 

N 

l>'nl 

j 

(20) 
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D.  TWo-State  Soft  Decision  Maximum  Likelihood  Envelope  Estimation 
The  suppression  rules  for  the  power  subtraction,  Wiener  filtering  and 
maximum  likelihood  algorithms  are  illustrated  in  Fig.  1.  Their  suppression 
capabilities  were  evaluated  for  speech  in  airborne  command  post  noise  using 
a real  time  implementation  of  the  prefilter  (to  be  described  in  detail  in 
Section  III).  While  it  was  difficult  to  determine  which  algorithm  did  the 
best  job  of  extracting  the  speech  when  speech  was  present,  it  was  apparent 
that  none  of  the  algorithms  adequately  suppressed  the  background  noise  when 
speech  was  absent.  This  is  hardly  surprising  in  view  of  the  fact  that  the 
suppression  rules  were  derived  on  the  assumption  that  speech  was  always 
present  in  the  measured  data.  Had  a detector  been  used  to  determine  that  a 
given  frame  of  data  consisted  of  noise  alone,  then  obviously  a better 
suppression  rule  would  have  been  to  apply  greater  attenuation  than  indicated 
by  the  curves  in  Fig.  1.  From  this  point  of  view  it  follows  that  a better 
suppression  curve  might  evolve  if  a two  state  model  for  the  speech  event  is 
considered  at  the  outset;  that  is  either  speech  is  present  or  it  is  not. 
Mathematically  this  leads  to  the  binary  hypothesis  model: 


Hq-.  speech  absent:  IVnl  = Iw^l  1 

• I 

Hji  speech  present:  ly^|  = |Ae^®  + w^|  (21)  j 

I- 

« 

Only  the  measured  envelope  is  used  in  this  measurement  model  since  it  has  ■ 

already  been  shown  that  the  measured  phase  provides  no  useful  informat < )n  in  ! 

the  suppression  of  the  noise.  A useful  criterion  for  estimating  the  spectral 
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nth  CHANNEL  GAIN  (dB) 


envelope  A,  is  to  choose  A to  minimize  the  mean  squared  spectral  error 
E(A  - A)  . It  is  well  known  [16]  that  the  resulting  estimator  is  the 
conditional  mean  A = E(A|V)  where  V = |y^|  is  used  for  notational  con- 
venience to  represent  the  measured  envelope.  In  this  formulation  the 
expectation  operator  is  used  to  indicate  averaging  over  the  ensemble  of 
noise  sample  functions,  speech  envelopes  and  phases  and  the  ensemble  of 
speech  events.  The  averaging  for  the  latter  case  is  carried  out  explicitly 
and  results  in  the  estimator 

A = E(A|V,Hj)P(Hj|V)  ♦ E(A1v.Hq)P(H(j|V)  (22) 

where  P(Hj^lV)  is  the  probability  that  the  speech  is  in  state  given  the 
measured  envelope  has  the  value  V.  Since  E(a1v,Hq)  represents  the  average 
value  of  A given  an  observation  V and  the  fact  that  speech  is  absent,  then 
obviously  this  value  must  be  zero,  hence  (22)  reduces  to 

A >=  E(A|V,Hj)P(Hj|V)  (23) 

Since  E(A|V,Hj)  represents  the  minimum  variance  estimate  of  A when  speech 
is  present  and  since  the  maximum  likelihood  estimator  is  asymptotically 
efficient  for  large  SNR,  it  suffices  to  replace  E(A|V,Hj)  by  the  estimator 
derived  in  (19),  hence 

A~  - X^]p(Hj|V)  (24) 


1 


» 


\l 
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Application  of  Bayes  rule  gives 


p(VlHj)PCMj) 

" p(V|HpP(H^)  ♦ p(V|Hy)PTHQ) 


(25) 


where  p(V|Hj^)  is  the  A priori  probability  density  function  for  the  measured 
envelope  given  the  speech  state  Assuming  that  the  speech  and  noise 

states  are  equally  likely  (a  worst  case  assumption), 


P(H^)  = Pin^j)  = i- 


(26) 


Under  hypothesis  V = jw|  and  since  the  noise  is  complex  Gaussian  with 
mean  zero  and  variance  X^,  it  follows  that  the  envelope  has  the  Rayleigh 
pdf 


PCVIHq) 


X 

w 


(27) 


Under  hypothesis  H^, 


V = 


lAe^^  ♦ w| 


and  the  envelope  has  the  Rician  pdf 


P(V|Hj) 


— ®*P(' 

w 


w2  ,2 

V ♦A 


X 

w 


I (^) 

o'-x  ^ 


w 


(28) 
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Defining  the  suppression  factor  e to  be 


and  substituting  (26),  (27),  and  (28)  into  (25)  results  in  the  following 
expression  for  the  h posteriori  probability  for  the  presence  of  speech. 


exp(-C)I, 


P(Hj|V)  = 


w 


1 + exp(-e)L 


(so) 


It  is  this  term  which  contributes  the  soft  suppression  to  the  maximum 
likelihood  envelope  estimator.  Appending  the  measured  phase  to  the  estimated 
envelope  in  order  to  preserve  the  identity  system  in  the  absence  of  noise, 
then  the  final  suppression  rule  is 


P(HjlV)  . y 


(31) 


In  Fig.  2 several  curves  for  the  k posteriori  probability  for  the  speech 
state  P(Hj|V)  are  illustrated  for  various  values  of  the  suppression  factor 
C.  The  channel  gains  obtained  when  these  & posteriori  probabilities  are 
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appended  to  the  maximum  likelihood  suppression  rule  are  shown  in  Fig.  3. 

I The  two-state  soft-decision  maximum  likelihood  algorithm  applies  consider- 

ably more  suppression  when  the  measurement  corresponds  to  low  speech  SNR. 
Since  this  case  "most  likely"  corresponds  to  noise  alone,  it  is  seen  that 

[ the  effect  of  the  residual  noise  (false  alarms')  should  be  considerably 

[ 

^ reduced.  When  the  speech  SNR  is  largo,  the  measured  SNR  will  be  large  and 

it  "most  likely"  means  that  speech  is  present,  in  which  case  the  original 
maximum  likelihood  algorithm  is  the  correct  rule  for  extracting  the  speech 
envelope. 

III.  IMPl.FMiNTATlON 

All  of  the  noise  suppression  prefilters  that  have  boon  reported  on  to 
date  have  been  implemented  in  the  frequency  domain.  This  corresponds 
nicely  to  the  theoretical  orthogonal  channel  decomposition  used  in  Sec- 
tion II  and  exploits  the  properties  of  the  FIT  for  filtering  by  circular 
convolution.  Since  the  present  work  evolved  from  an  attempt  to  implement  a 
time  domain  Kalm.an  filter  bused  on  a parallel  formant  model  for  speech  [17], 
and  since  a contemporary  implementation  of  a chaimol  vocoder  is  being 
developed  using  CCD  technology  to  produce  a package  which  operates  at  rates 
from  1.2  to  4.8  kbs,  requires  about  50  integrated  circuits,  occupies 
.22  cu.  ft.,  requires  5 watts  and  weighs  5 lbs  [18],  it  seemed  appropriate 
to  attempt  a time  domain  implementation  of  the  prefilter  that  could  ex^iloit 
this  emerging  technology.  As  in  the  cinumel  vocoder  19  filters  are  used  to 
span  the  frequency  range  180  - 3720  Hz  (tl»o  sampling  rate  was  7575  Hz'). 

l^ 


I 

■ i 
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SPEECH -TO -NOISE  RATIO  (dB) 

-10  -e  -*  -4  -2  0 2 4 6 8 10  12  14  18 


Fig.  3.  Suppression  rules  for  maximun  likelihood  with  soft  suppresion. 
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Each  filter  in  the  bank  is  a result  of  a bandpass  transformation  of  a 


second  order 

Butterworth  filter.  The  center  frequencies 

and  the  bandwidths 

for  each  of  the  filters 

in  the  bank  are  listed  in  Table  I 

and  a plot  of 

their  linear 

magnitudes 

is  shovm  in  Fig.  4. 

TABLE  I 

CHANNEL  FILTER  SPECIFICATIONS 

CHANNEL 

NUMBER 

CENTER 

FREQUENCY 

3 dB 

BANDWIDTH 

0 

240 

120 

1 

360 

120 

2 

480 

120 

3 

600 

120 

4 

720 

120 

5 

840 

120 

6 

975 

150 

7 

1125 

150 

8 

1275 

150 

9 

1425 

150 

10 

1575 

150 

11 

1750 

200 

12 

1950 

200 

13 

2150 

200 

14 

2350 

300 

15 

2b00 

300 

16 

2900 

300 

17 

3200 

300 

18 

3535 

370 

Sampling  Rate  > 132  usee 


I; 


I 

I 
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Although  theory  requires  that  the  channels  be  orthogonal,  in  practice 
overlapping  filters  provide  for  spectral  smoothing  which  is  known  to  be  an 
important  factor  in  the  design  of  noise  suppression  systems. [11]  The 
filters  in  the  channel  vocoder  were  originally  chosen  to  provide  a good 
compromise  for  smoothing  the  envelope  of  the  speech  spectrum,  hence  their 
lack  of  orthogonally  turns  out  to  be  an  asset  in  this  particular  case. 

Since  the  19  filters  span  the  frequency  range  of  the  speech  signal,  the 
front  end  of  the  channel  vocoder,  in  the  absence  of  noise,  represents  an 
identity  system  provided  the  outputs  of  each  of  the  channels  are  added 
alternately  out  of  phase,  as  shown  in  the  block  diagram  in  Fig.  5. 

In  order  to  compute  the  channel  gains,  measurements  must  be  made  to 
determine  the  instantaneous  signal  power  and  the  average  noise  power  at  the 
output  of  each  of  the  channel  filters.  Since  the  speech  parameters  change 
very  little  in  20  ms,  some  temporal  smoothing  can  be  exploited  by  computing 
the  signal  power  from 

2 1^2 

V = ^ I y_(k)  (32) 

^ k=l  " 

where  represents  the  signal  sample  out  of  the  nth  channel  at  time  k, 

where  there  are  N such  samples  in  the  20  ms  frame  (the  normalization  by  N 
will  be  unnecessary).  Determination  of  the  background  noise  power  requires 
knowledge  of  whether  or  not  a particular  frame  contains  speech.  One 
approach  to  making  this  determination  has  been  developed  by  Roberts  [19] 
who  noted  that  a 4- sec  histogram  of  the  friime  energies  of  the  input  signal 

i 

i 
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was  bimodal.  He  found  that  by  setting  a detection  threshold  between  the 
inodes,  correct  speech  and  noise  classification  could  be  made  most  of  the 
time.  A modification  of  this  algorithm  which  is  described  in  detail  in  the 
appendix,  was  used  to  determine  with  high  confidence  the  frames  that  were 
absent  of  speech.  For  those  frames  the  average  noise  power  in  each  channel 
was  estimated  by  smoothing  the  measurements  in  (32)  using  a 1-sec  time 
constant.  The  major  drawback  of  this  algorithm  was  the  relatively  long 
adaptation  time  needed  to  determine  the  detection  threshold  and  then  the 
additional  training  period  required  to  learn  the  channel  noise  powers.  An 
alternative  scheme,  which  eliminates  the  requirement  for  a noise  detector 
has  been  proposed  by  Paul. [4]  In  this  scheme  the  average  channel  noise 
power  is  updated  after  every  frame  according  to  the  rule 

X^(m)  = X^(m-l)  + a(m)(V^(m)  - X^(m-l)]  (33) 

2 

where  V (m) , A^(m)  represents  the  measured  power  and  the  average  noise 
power  computed  for  the  mth  frame.  The  averaging  time  constant  is  con- 
trolled by  a(m)  and  is  chosen  adaptively  to  correspond  to  a 1 sec  time 
2 

constant  if  V (m)  X^(m-l)  and  to  a 100  ms  time  constant  if 

2 

V (n>)  < As  a result  of  this  adaptivity  the  circuit  rapidly  adapts 

2 

the  noise  power  to  the  value  of  V (m)  whenever  there  is  a speech  gap  while 
during  connected  speech  the  noise  level  increases  slowly  enough  that  the 
noise  power  will  not  take  on  the  attributes  of  the  speech.  Since  this  rule 
is  easy  to  compute  and  can  track  nonstationary  noise,  its  performance 
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warranted  comparison  with  Roberts'  noise  detector. 

2 

Using  the  measurement  of  V (m)  and  the  estimated  average  value  of 
X^(m-l),  the  gain  factor 


- X (m-1) 

g(m)  = 2 = (34) 

V^(m) 


2 

is  computed  for  each  channel.*  Since  V (m)/X^(m-l)  can  be  expressed  in 
terms  of  g(m)  then  the  noise  suppression  rule,  (30)  and  (31)  can  be  written 
as 


exp  (-t)  1 


Ml 


1 > exp(-C)Io  2 


(35) 


The  advantage  in  using  g(m)  as  the  independent  variable  is  the  fact  that 
0 ^ g(n>)  which  permits  the  use  of  a simple  software  divide  routine  in 
forming  the  normalization.  For  a given  value  of  the  suppression  factor,  C, 
the  measured  gain  g(m)  is  used  as  a pointer  for  a table  look-up  to  deter- 
mine the  attenuation  prescribed  by  (35).  Fifteen  tables  corresponding  to 
values  of  t ■ 1,  2,  3,..., 15  have  been  included  in  the  prefilter  with  each 
table  consisting  of  50  values  of  the  suppression  rule  computed  for  equal 
increments  of  g(m)  from  0 to  1.  No  attempt  was  made  to  optimize  the  design 
of  these  tables.  All  of  the  coding  was  done  in  machine  language  on  the 
LDVT  [19]  which  has  the  ability  to  key  in  a new  value  of  the  suppression 


•This  is  where  t)»e  normalization  by  N in  (32)  disappears. 
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factor  in  real  time.  This  meant  that  the  prefilter  could  easily  be  adjusted 
to  accommodate  a wide  class  of  operational  environments.  This  turned  out 
to  be  a significant  capability  for  effective  noise  suppression.  Since  the 
algorithm  was  designed  to  operate  in  real  time,  a 10  ms  delay  had  to  be 
incurred  between  the  time  the  energies  were  measured  and  the  time  the 
corresponding  gains  could  be  computed  and  applied  to  the  channel  waveforms. 
This  was  done  by  computing  the  energies  (block  floating  point)  in  10  ms 
segments  and  adding  consecutive  segments  together  to  produce  the  desired 
20  ms  energy  measurement.  This  permitted  computation  of  the  raw  gains, 

G(m),  every  10  ms.  In  order  to  avoid  the  introduction  of  discontinuities 
in  the  output  waveform  the  final  output  is  a smoothed  gain  G{m)  obtained 
according  to 

G(m)  = G(m-l)  ♦ 6(m)[g(m)  - G(m-l)l  (Sb) 


Since  the  introduction  of  smoothing  can  cause  the  prefilter  to  be  slow  to 
respond  to  a leading  edge  transition  which  could  result  in  speech  dis- 
tortion, the  gain  in  (3S)  is  chosen  adaptively  according  to  the  rule 


e(m) 


1 if  g(m)  > G(m-l) 
y if  g(m)  < C.(m-l)  . 


(37) 
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In  this  way  the  prefilter  responds  immediately  to  an  increase  in  the  SNR 
which  should  minimize  the  potential  for  leading  edge  distortion.  During  a 
trailing  edge,  in  which  the  gain  will  be  decreasing,  the  smoothed  gain  will 
be  used  which  will  tend  to  maintain  the  speech  signal  even  though  the  noise 
becomes  dominant.  It  is  the  gain  G(ml  in  (37)  that  is  applied  to  the 
waveform  at  the  output  of  each  of  the  channel  filters.  These  waveforms 
were  then  added  together  alternately  180°  out  of  phase  to  produce  the 
prefilter  output  waveform  s(t). 


IV.  EXPERIMENTAL  RESULTS  AND  CONCLUSIONS 

Since  the  prefiltering  algorithm  operated  in  real  time  it  was  possible 
to  perform  extensive  listening  tests  on  a large  speech  and  noise  data  base. 
It  was  of  particular  interest  to  determine  the  operational  performance  of 
the  prefilter  in  conjunction  with  a 2400  bps  vocoder  operating  in  a back- 
ground of  E4A  advanced  airborne  command  post  noise  (ACPN) . Source  tapes 
were  available  for  this  environment  consisting  of  lists  spoken  by  six  male 
speakers  for  which  a DRT  score  and  a diagnostic  acceptability  measure  (DAM) 
could  be  computed.  The  recordings  were  made  using  both  a high  quality 
Altec  microphone  and  a noise  cancelling  microphone. 

The  first  experiment  consisted  of  listening  to  the  output  of  the 
prefilter  for  various  values  of  the  suppression  factor.  It  was  always 
possible  to  select  a suppression  factor  which  would  render  the  background 
noise  imperceptible,  although,  for  cases  in  which  the  SNR  was  low  enough, 
the  cost  in  doing  this  was  the  introduction  of  various  degrees  of  speech 
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distortion.  In  these  cases,  if  the  suppression  factor  was  subsequently 
reduced,  the  speech  distortion  was  reduced  at  the  expense  of  introducing  a 
perceptible  level  of  background  noise. 

In  the  next  experiment  the  prefilter  was  connected  in  tandem  with  the 
2400  bps  LPC  vocoder  which  used  the  Gold-Rabiner  pitch  estimator. [21 , 22] 

An  unexpected  result  was  obtained.  If  the  suppression  factor  was  set  to 
remove  the  residual  noise  at  the  output  of  the  prefilter  then  the  speech 
quality  at  the  vocoder  output  was  poor  due  to  both  buzz-hiss  errors  and 
spectral  distortion.  If,  however,  the  suppression  factor  was  chosen  so 
that  the  noise  at  the  vocoder  output  was  negligible,  then  a significantly 
lower  value  of  the  suppression  factor  was  needed  and  the  speech  quality  was 
quite  good,  although  the  Gold-Rabiner  algorithm  continued  to  make  buzz-hiss 
errors,  but  at  a lower  rate.  In  other  words,  LPC  itself  has  some  suppres- 
sion capabilities  against  weak  noise  which  can  usefully  be  exploited  in  the 
tandem  connection.  It  was  the  flexibility  in  selecting  the  prefilter 
suppression  factor  which  made  this  result  possible. 

Since  the  deployment  of  the  LPC  vocoder  does  allow  for  flexibility  in 
the  specification  of  the  pitch  extractor,  it  was  of  interest  to  determine 
whether  or  not  algorithms  that  were  specially  designed  to  operate  in  noise 
would  operate  more  effectively  in  the  tandem  connection.  Such  an  algorithm, 
based  on  maximum  likelihood  estimation  techniques,  has  been  under  develop- 
ment for  some  time  [23]  and  was  chosen  to  be  tested  against  the  Gold- 
Rabiner  algorithm.  In  the  subjective  listening  tests  it  was  found  that, 
indeed,  smoother  pitch  tracks  could  be  obtained  with  a lower  rate  of  buzz- 
hiss  errors. 
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Although  the  results  of  using  the  prefilter  always  produced  subjec- 


tively more  pleasant  sounding  speech  to  the  ear  since  the  annoying  and 
tiresome  background  noise  was  removed,  it  was  important  to  determine 
whether  or  not  there  was  a corresponding  quantitative  improvement  in 
intelligibility.  To  do  this  DRT  scores  are  being  obtained  for  the  pre- 
filtered speech  and  the  speech  out  of  the  LPC  tandem  for  both  the  Gold- 
Rabincr  and  the  maximum  likelihood  pitch  extraction  algorithms.  Results 
are  currently  being  obtained  for  both  the  Altec  dynamic  microphone  and  the 
confidencer  noise  cancelling  microphone  and  will  be  reported  once  all  of 
the  data  has  been  collected  and  analyzed. 

So  far  the  focus  has  been  on  the  19-channel  prefilter  based  on  the 
principles  of  channel  vocoder  design.  This  was  strictly  a pragmatic  choice 
which  was  made  to  facilitate  the  development  of  a real-time  testbed. 
Questions  relating  to  the  number  of  filters,  the  bandwidths  and  the  choice 
of  center  frequencies  remain  to  be  addressed.  Although  the  time  domain 
structure  of  the  channel  prefilter  is  well  suited  to  an  analog  implemen- 
tation using  CCD  technology,  it  is  of  interest  to  determine  the  tradeoffs 
with  respect  to  a frequency  domain  approach  using  the  FFT.  Whatever 
candidate  system  is  chosen  for  evaluation,  using  the  class  of  suppression 
rules  developed  in  this  study  allows  the  overall  design  to  be  optimized 
with  respect  to  the  noise  suppression/ speech  distortion  tradeoff  by  choosing 
an  appropriate  suppression  factor.  In  this  way  performance  differences  can 
be  attributed  to  the  system  design  parameters  independent  of  a particular 
suppression  rule  which  may  have  represented  a poor  choice  for  the  particular 
signal  and  noise  conditions  used  in  the  evaluation. 
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APPliNDIX 


MODIFIED  ROBERTS  NOISE  DETECTION  ALGORITHM 

In  order  to  estimate  the  statistics  of  the  background  noise,  it  is 
desirable  to  inspect  only  those  frames  of  data  which  have  a high  proba- 
bility of  containing  no  speech.  To  accomplish  this,  an  adaptive  energy 
threshold  marking  the  probable  boundary  between  noise  and  noise  plus  speech 
is  established  by  monitoring  the  energy  on  a frame  by  frame  basis  and 
maintaining  energy  histograms  which  reflect  the  bimodal  distribution  of  the 
energy.  The  flow  chart  for  the  algorithm,  shown  in  Fig.  6,  is  described  in 
the  following  paragraphs. 

For  each  frame  the  sum  of  the  squares  of  the  input  samples  is  computed. 
If  this  energy  does  not  exceed  lb  bits  (i.e.,  does  not  strongly  imply  the 
presence  of  speech),  the  adaptive  threshold  algorithm  is  exercised.  First, 
a decay  factor  of  .995  is  applied  to  a 128-bin  histogram  of  uniform  ranges 
of  energy  causing  exponential  decay  of  the  histogram  values  with  a time 
constant  of  4 seconds.  The  value  of  the  bin  which  encom()asses  the  energy 
of  the  current  frame  is  incremented  by  160.  A typical  energy  histc  ram 
after  adaptation  is  complete  is  shown  in  Fig.  7a. 

A second  128-point  cummulative  histogram  is  then  formed  to  represent 
the  area  under  the  first  histogram  by  computing  the  accumulated  scores  from 
rhe  low  energy  bin  through  a high  energy  bin.  Fig.  7b  shows  the  result  of 
accumulating  the  scores  from  the  histogram  in  Fig.  7b.  If  the  10th  point 
of  the  second  histogram  exceeds  25%  of  the  total  area,  it  is  assumed  that 
there  is  no  noise  present. 
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Fig.  6.  Modified  Roberts  noise  detection  algorithm 


PERCENT 


If  noise  is  present,  a search  is  made  through  the  second  histogram  for 
the  point  which  represents  80V  of  the  total  area.  The  quantum  of  energy 
corresponding  to  this  point  becomes  the  new  threshold  candidate.  If  this 
candidate  exceeds  the  current  threshold,  the  threshold  is  if>dated  using  a 
decay  factor  of  .95,  a slow  time  constant  of  400  ms.  If  the  candidate  is 
below  the  current  threshold,  the  threshold  is  updated  with  a decay  factor 
of  .60653  a fast  time  constant  of  40  ms. 

If  noise  is  absent,  the  new  threshold  candidate  is  set  to  zero,  and 
the  threshold  is  updated  using  a decay  factor  of  .60653,  a fast  time 
constant  of  40  ms.  Finally,  the  threshold  is  held  to  a minimum  of  1024  to 
guarantee  updating  of  the  estimated  noise  components  when  background  noise 
suddenly  disappears. 
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Mine  way  of  enhancing  speech  in  an  additive  acoustic  noise  environment  is  to  perform  a spectral 
decomposition  of  a frame  of  noisy  speech  and  to  attenuate  a particular  spectral  line  depending  on  how 
much  the  measured  speech  plus  noise  power  exceeds  an  estimate  of  the  background  noise.  Using 
a two  sate  model  for  the  speech  event  (speech  absent  or  speech  present)  and  determining  the  max- 
imum likelihood  estimator  of  the  speech  power  results  In  a new  class  of  suppression  curves  which 
permits  a tradeoff  of  noise  suppression  against  speech  distortion.  The  algorithm  has  been  Imple- 
mented In  real  time  In  the  time  domain,  exploiting  the  smiemre  of  the  channel  vocoder.  Extensive 
testing  has  shown  that  the  noise  can  be  made  Imperceptible  by  proper  choice  of  the  suppression 
(actor. 
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